LATE  FUSION  AND  CALIBRATION  FOR  MULTIMEDIA  EVENT  DETECTION 

USING  FEW  EXAMPLES 


Julien  van  Hout1,  Eric  Yeh1,  Dennis  C.  Koelma2,  Cees  G.M.  Snoek2 
Chen  Sun3,  Ramakant  Nevada3,  Julie  Wong1,  Gregory  K.  Myers1 


'SRI  Intemadonal,  Menlo  Park,  USA 
2University  of  Amsterdam,  The  Netherlands 
3University  of  Southern  California,  Los  Angeles,  USA 


ABSTRACT 

The  state-of-the-art  in  example-based  multimedia  event  detection 
(MED)  rests  on  heterogeneous  classifiers  whose  scores  are  typically 
combined  in  a  late-fusion  scheme.  Recent  studies  on  this  topic  have 
failed  to  reach  a  clear  consensus  as  to  whether  machine  learning 
techniques  can  outperform  rule-based  fusion  schemes  with  varying 
amount  of  training  data.  In  this  paper,  we  present  two  parametric 
approaches  to  late  fusion:  a  normalization  scheme  for  arithmetic 
mean  fusion  (logistic  averaging )  and  a  fusion  scheme  based  on  lo¬ 
gistic  regression,  and  compare  them  to  widely  used  rule-based  fusion 
schemes.  We  also  describe  how  logistic  regression  can  be  used  to 
calibrate  the  fused  detection  scores  to  predict  an  optimal  threshold 
given  a  detection  prior  and  costs  on  errors.  We  discuss  the  advan¬ 
tages  and  shortcomings  of  each  approach  when  the  amount  of  pos¬ 
itives  available  for  training  varies  from  10  positives  (lOEx)  to  100 
positives  (lOOEx).  Experiments  were  run  using  video  data  from  the 
NIST  TRECVID  MED  2013  evaluation  and  results  were  reported  in 
terms  of  a  ranking  metric:  the  mean  average  precision  (mAP)  and 
Ro,  a  cost-based  metric  introduced  in  TRECVID  MED  2013. 

Index  Terms —  multimedia  event  detection,  late  fusion,  score 
calibration,  score  normalization,  system  fusion 

1.  INTRODUCTION 

As  the  quantity  of  online  user-submitted  multimedia  content  grows, 
indexing  and  reliably  searching  for  specific  content  becomes 
increasingly  challenging.  Moreover,  the  data  is  very  heterogeneous 
and  often  of  poor  audio  or  visual  quality,  which  challenges  the 
accuracy  of  current  event  detection  technologies.  The  track  of  mul¬ 
timedia  event  detection  conducted  under  the  TRECVID  evaluations 
by  NIST  aims  to  solve  the  problem  of  detecting  specific  events 
like  “changing  a  tire”  or  “grooming  an  animal”  in  a  heterogeneous 
corpus  of  user-submitted  video  clips.  Accurately  detecting  such 
precise  events  requires  input  from  various  analysis  channels  (image 
and  motion  analysis,  audio  concepts,  speech  content,  character 
recognition,  etc.)  that  we  will  refer  to  as  MED  modalities.  The  best- 
performing  approaches  to  solving  this  task  use  various  modalities 
and  combine  their  detection  scores  in  a  late-fusion  scheme.  It  is  to 
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be  noted  that  while  some  researchers  have  successfully  developed 
early-fusion  schemes  [1,  2]  to  combine  different  modalities  together 
and  learn  joint  classifiers,  not  all  modalities  can  be  combined  in 
this  way  and  these  systems  still  rely  heavily  on  late-fusion  as  a  final 
combination  stage  [2], 

Approaches  to  late-fusion  mainly  fall  into  two  categories: 
rule-based  or  statistical  model  based.  Simple  rule-based  fusions 
(like  arithmetic  mean  or  geometric  mean)  which  first  normalize 
the  scores  to  a  comparable  range  and  then  treats  each  modality 
identically  are  popular  for  their  inherent  robustness  to  over-fitting 
[3,  4,  5].  Other  rule-based  techniques  (like  weighted  averaging) 
use  different  weights  for  different  modalities.  Here,  the  weights 
are  found  using  grid-search  [6,  7],  are  set  to  a  measure  of  the 
performance  of  each  modality  [3],  or  to  a  measure  of  confidence  of 
the  score  [8],  Machine  learning  alternatives  like  logistic  regression 
[5,  9],  ridge  regression  [7],  linear  support  vector  machine  (SVM)  [9] 
and  explicit  optimization  of  an  evaluation  metric  [9]  have  also  been 
explored.  While  the  above  studies  often  compare  multiple  fusion 
techniques  to  one  another,  their  conclusions  can  vary  widely.  For 
instance,  [9,  6]  claim  gains  from  logistic  regression  fusion  compared 
to  arithmetic  mean  or  grid-based  search  techniques  while  other 
studies  found  the  opposite  conclusion  on  a  similar  task  [3,  4,  5]. 
Given  the  diversity  of  the  modalities  to  fuse  across  research  teams, 
and  the  fact  that  their  scores  show  very  different  distributions  (Gaus¬ 
sian,  exponential,  bimodal,  etc.),  we  believe  that  such  conflictive 
conclusions  could  be  explained  by  differences  in  the  modalities' 
score  distributions,  differences  in  the  type  of  features  used  in 
learning-based  techniques,  and  differences  in  the  way  missing  val¬ 
ues  are  handled.  Unfortunately,  these  details  are  often  overlooked 
in  the  above  studies,  making  it  difficult  to  draw  definite  conclusions. 

The  challenge  of  handling  missing  values  in  late  fusion  is  very 
common  in  detection  tasks,  especially  in  MED  where  modalities’ 
scores  can  go  missing  for  various  reasons:  no  audio  was  available, 
no  speech  was  detected,  no  motion  was  detected  in  the  video,  etc. 
Traditional  ways  of  dealing  with  missing  features  in  late-fusion 
include  inferring  the  missing  scores  from  the  mean  of  scores  from 
other  videos  or  setting  the  missing  score  to  be  the  minimum  score. 
Ideally,  one  would  like  to  not  make  any  assumption  about  the 
missing  score’s  value  but  rather  learn  its  value  for  various  events 
and  modalities.  Such  an  approach  has  been  successfully  applied  to 
other  detection  tasks  such  as  speaker  identification  [10]  or  spoken 
term  detection  [11]  by  using  a  logistic  regression  framework  with 
binary  side-information. 

While  MED  performance  is  usually  measured  in  terms  of 
mean  average  precision  (mAP),  we  also  considered  the  Ro  metric 


introduced  in  TRECVID  MED  2013.  Ro  can  be  interpreted  as  a  risk 
based  on  costs  of  misses  and  false  alarms  that  the  system  should 
minimize  by  picking  the  right  threshold.  The  main  challenge  when 
optimizing  such  a  metric  is  to  properly  calibrate  the  scores  such  that 
a  good  threshold  can  be  chosen.  Also,  a  fusion  strategy  that  gave 
the  best  mAP  might  not  be  optimal  in  terms  of  Ro,  since  the  two 
metrics  target  different  use  cases.  Prior  work  in  speaker  detection 
[12]  has  found  logistic  regression  to  be  a  very  efficient  approach  to 
both  calibration  and  fusion  over  a  wide  range  of  operating  points. 

In  this  paper,  we  will  introduce  a  late-fusion  framework  based 
on  logistic  regression,  that  handles  missing  features  as  binary  side- 
information.  We  also  introduce  a  novel  discriminative  normalization 
scheme  for  arithmetic  mean  called  logistic  averaging  that  is  robust  to 
limited  number  of  training  examples.  Finally,  we  present  a  strategy 
to  calibrate  the  final  scores  and  pick  optimal  thresholds  for  Ro  and 
report  MED  results  for  both  the  mAP  and  Ro  metrics. 

2.  DESCRIPTION  OF  MODALITIES 

In  this  section,  we  describe  the  scope  of  our  individual  modalities 
and  how  they  were  trained.  A  more  in  depth  description  of  each 
modality  can  be  found  in  [13] 

Low-level  visual  features  We  extracted  low-level  visual 
features  for  two  frames  per  second  from  each  video.  We  followed 
the  bag-of-codes  approach,  which  considers  spatial  sampling  of 
points  of  interest,  visual  description  of  those  points,  and  encoding 
of  the  descriptors  into  visual  codes.  We  used  a  mixture  of  SIFT, 
TSIFT,  and  C-SIFT  descriptors  [14],  We  computed  the  descriptors 
around  points  obtained  from  dense  sampling  and  reduced  them  to 
80  dimensions  with  principal  component  analysis.  We  encoded 
the  color  descriptors  using  Fisher  vectors  with  a  Gaussian  Mixture 
Model  codebook  of  256  elements  [15]. 

Semantic  visual  features  We  detected  semantic  concepts 
for  each  frame  using  low-level  visual  features  and  following  the 
approach  in  [16].  We  trained  1,346  concept  detectors  based  on 
linear  SVMs.  Each  frame  is  then  represented  by  the  concatenated 
detector  scores  from  all  these  concepts. 

Visual  event  classifiers  We  included  three  visual  event  clas¬ 
sifiers  based  on  low-level  and  semantic  features.  To  arrive  at  a 
video-level  representation  for  the  low-level  visual  event  classifier, 
we  relied  on  simple  frame  averaging.  For  the  two  video  event 
classifiers  based  on  semantic  features,  we  aggregated  the  concept 
vectors  per  frame  into  a  video-level  representation.  One  approach 
used  averaging  and  normalization,  while  the  other  method  used 
semantic  encoding.  On  top  of  both  concept  representations,  we  used 
an  SVM  with  x2  kernel. 

Low-level  motion  features  The  two  low-level  motion  features 
were  based  on  Dense  Trajectories  (DTs)  [17]  and  MoSIFT  [18]. 
We  computed  DT  raw  features  with  a  step  size  of  10  pixels  and 
MoSIFT  raw  features  with  default  parameters.  The  raw  features 
were  encoded  using  first-  and  second-order  Fisher  Vector  descriptors 
with  a  two-level  spatial  pyramid  [19].  Descriptors  were  aggregated 
across  each  video.  We  generated  four  event  classifiers:  two  with 
DT  features  using  first-  and  second-order  Fisher  Vector  descriptors, 
and  two  with  MoSIFT  features  using  first-  and  second-order  Fisher 
Vector  descriptors.  A  Gaussian-kernel  SVM  was  used  for  classifica¬ 
tion,  and  the  outputs  from  the  same  low-level  feature  were  averaged. 

Motion  event  classifiers  Two  event  classifiers  were  generated 
based  on  action  concept  detectors.  There  were  96  action  concepts 
annotated  on  the  MED  11  Event  Kit  provided  by  Sarnoff/UCF, 


and  101  action  concepts  from  UCF  101  [20].  The  action  concept 
detectors  were  applied  to  small  segments  of  videos  and  encoded  by 
Hidden  Markov  Model  Fisher  Vector  descriptors  [21],  SVM  with 
Gaussian  kernel  was  used  to  train  two  event  classifiers,  one  for  each 
set  of  action  concepts. 

Low-level  audio  content  For  our  audio  features,  we  extracted 
mel-frequency  cepstral  coefficients  (MFCCs)  over  a  10-ms  window. 
MFCCs  describe  the  spectral  shape  of  audio.  The  first  and  second 
derivatives  of  the  MFCCs  were  also  computed.  The  MFCC 
features  were  difference-coded  with  Fisher  vectors  using  a  1024- 
element  Gaussian  Mixture  Model  and  classified  using  a  linear  SVM. 

Spoken  content  We  ran  an  English  ASR  model  trained  on 
conversational  telephone  data  and  adapted  to  meeting  data.  We 
performed  supervised  acoustic  model  adaptation  using  in-domain 
annotated  TRECVID  data  and  unsupervised  adaptation  using  the 
first-pass  recognition  output.  We  also  performed  supervised  and  un¬ 
supervised  language  model  adaptation.  The  lattice-based  approach 
described  in  [22]  was  used  to  build  the  MED  classifier,  and  the  final 
score  was  the  distance  to  the  hyperplane  of  a  LI -regularized  linear 
SVM  model,  mapped  to  [0, 1]  by  using  a  logistic  function. 

Written  content  SRI’s  English  video  OCR  software  detected 
and  recognized  text  appearing  in  the  TRECVID  MED  2013  video 
imagery.  This  software  recognizes  both  overlay  text,  such  as  cap¬ 
tions  that  appear  on  broadcast  news  programs,  and  in-scene  text  on 
signs  or  vehicles  [23].  For  each  event,  we  generated  event  profiles 
from  the  event  descriptions  by  using  term  frequency-inverse  docu¬ 
ment  frequency  (TF-IDF)  weightings  to  rank  the  relevance  of  non- 
stop- words.  The  event  detection  score  for  each  video  was  the  cosine 
similarity  between  the  word  vector  for  the  video  and  the  word  vector 
for  the  event  profile. 

3.  LATE  FUSION 

In  this  section,  we  present  approaches  to  late  fusion  that  were  used 
in  our  experiments.  The  scores  Xi  from  each  of  the  N  modalities  are 
detection  probabilities  and  therefore  lie  in  [0, 1].  The  goal  of  late- 
fusion  is  estimating  the  probability  of  the  label  y  of  a  given  video 
given  the  score  vector  x  =  [xo,xi, ...,  xjv-i]. 

3.1.  Baseline  fusions 

We  describe  a  few  simple,  widely  used  baseline  fusion  methods. 

Arithmetic  mean  This  method  combines  scores  from  various 
modalities  by  taking  the  arithmetic  mean  of  the  scores  for  each  trial. 
We  considered  two  ways  of  dealing  with  missing  scores.  In  the  AM- 
zero  technique,  a  missing  score  is  supposed  to  have  a  zero  value. 
In  the  AM-mean  technique,  a  missing  score  is  supposed  to  have  the 
mean  value  of  the  non-missing  scores  from  other  modalities.  The 
latter  technique  is  equivalent  to  computing  the  average  over  the  non¬ 
missing  scores  only. 

Geometric  mean  This  method,  referred  to  as  GM,  computes  the 
fused  score  for  a  given  trial  as  the  geometric  mean  of  all  non-missing 
and  non-zero  scores  for  that  trial. 

Weighted  averaging  In  this  technique,  the  final  score  is 
computed  as  a  weighted  sum  of  the  scores  for  each  modal¬ 
ity.  The  weights  are  chosen  by  optimizing  the  mAP  metric 
through  an  exhaustive  grid-search  with  weights  taking  values  in 
{0.001,  0.01,  0.03,  0.1, 0.3,  0.6, 1}.  The  weights  are  trained  on  the 
cross-validation  scores  on  the  training  data,  and  applied  to  the  test 
data.  We  study  two  different  setups  with  varying  number  of  trained 


parameters:  in  WM-dep  the  weights  are  event-dependent  and  in  WM- 
indep  the  weights  are  optimized  for  all  20  events  at  once. 

3.2.  Logistic  regression  fusion 

Logistic  regression  is  a  common  approach  for  converting  a  M- 
dimensional  vector  of  scores  into  a  single  value,  the  likelihood  ra¬ 
tio,  which  can  be  used  to  make  binary  decisions.  The  LR  model 
assumes  that  the  posterior  of  a  certain  clip  being  a  positive  has  the 
form  P(y  =  +l|x)  =  <r(axT  +  f3)  where  cr(x)  =  1/(1  +  e~x ) 
is  the  logistic  function.  The  parameters  a  =  [ao,  and 

(3  are  learned  by  maximizing  the  L 2  -regularized  likelihood  of  the 
model  on  labeled  training  data  by  using  the  “Trusted  Region  New¬ 
ton  Method"’  [24]  as  implemented  in  the  scikit-learn  library  [25]. 
The  regularization  parameter  was  tuned  using  cross-validation. 

We  propose  to  apply  logistic  regression  to  MED  late  fusion,  a 
technique  we  refer  to  as  LR,  as  follows.  For  each  trial,  we  create 
a  feature  vector  by  concatenating  the  logit  of  scores  of  all  of  the 
N  modalities,  where  the  logit  function  is  defined  by  logit(x)  = 
log(x/(  1  —  a;)).  The  logit  expands  the  dynamic  range  of  the  ex¬ 
ponentially  distributed  probabilistic  scores.  The  resulting  scores  are 
close  to  normally  distributed  for  both  positives  and  negatives  and 
behave  better  for  logistic  regression.  Missing  scores  are  set  to  zero, 
and  a  feature  is  added  for  each  modality  as  a  binary  indicator  vari¬ 
able  Imiss  accounting  for  the  possibility  of  missing  scores  for  some 
trials.  Initially  introduced  in  [11]  for  late  fusion  of  keyword  spotting 
systems,  this  approach  is  equivalent  to  learning  a  bias  for  the  missing 
score  value  of  each  modality.  Once  parameters  are  trained,  the  final 
posterior  is  given  by: 

JV-l 

P(y  =  +l|x)  =  a(  ^2  (a2 iXi  +  a2i+ilmiss(xi))  +  0) 

i= 0 

Additionally,  we  propose  to  automatically  perform  feature  se¬ 
lection  and  discard  some  modalities  during  training  by  looking  at 
the  trained  weights  a2i-  If  the  weight  corresponding  to  a  certain 
modality  is  found  to  be  negative,  the  logistic  regression  is  retrained 
with  that  modality  removed.  This  approach  is  based  on  the  intuition 
that  a  negative  weight  indicates  an  anti-correlation  between  the  score 
of  some  modality  and  the  label  of  a  video  clip,  which  is  the  sign  of  a 
poorly  performing  modality.  By  discarding  that  modality,  we  reduce 
the  noise  in  the  data  as  well  as  the  dimension  of  the  feature  vector 
and  obtain  better  generalization  properties.  This  approach  will  be 
referred  to  as  LR+fs.  We  also  considered  the  LR-min  and  LR-min+fs 
systems  where  a  missing  score  is  set  to  the  minimum  score  of  that 
modality  on  the  training  data.  These  two  systems  will  provide  a  com¬ 
parison  point  against  the  proposed  missing-values  handling  scheme. 

3.3.  Logistic  averaging 

The  logistic  averaging  technique,  or  LA,  is  a  novel  technique  that 
non-linearly  normalizes  the  scores  of  various  modalities  before  per¬ 
forming  arithmetic  mean  fusion. 

As  in  the  case  of  logistic  regression,  we  apply  the  logit  function 
to  the  posterior  scores  of  our  modalities  to  map  them  from  [0, 1]  to 
[—  inf,  +  inf].  We  apply  Z-normalization  by  computing  the  means 
and  variances  of  the  cross-validated  logit-scores  for  each  event  on 
the  training  data.  The  same  normalization  is  applied  to  the  videos 
in  the  test  set.  Then  we  map  those  scores  back  to  [0, 1]  using  the 
sigmoid  function  <ra,p  defined  as  aa,/3  =  1/(1  +  e_<-“x+/3)).  The 
parameters  a  and  (3  are  chosen  to  optimize  the  mAP  on  the  cross- 
validated  training  scores  using  a  grid  search.  If  X,  denotes  the  Z- 
normalized  logit-score  from  each  modality,  then  the  fused  score  is 
given  by: 


N—l 

P(y  =  +l|x)  =  —  ^  aoc Axi) 

i= 0 

As  mentioned  in  previous  work  on  late  fusion  of  biometric  sys¬ 
tems  [26],  Z-normalization  performs  best  when  the  input  distribu¬ 
tions  are  Gaussian  distributed.  By  applying  the  logit  to  our  initial 
scores,  which  are  exponentially  distributed,  we  obtain  near-Gaussian 
distributed  scores.  The  role  of  a  and  (3  is  to  enable  some  non¬ 
linearity  in  the  arithmetic  mean  fusion  by  tuning  a  sigmoid  that  mod¬ 
ifies  the  modalities’  score  distributions.  Because  the  logit-scores 
are  normalized  around  0  with  variance  1,  a  small  a  would  lead  to 
a  nearly  linear  mapping,  while  a  large  a  introduces  a  sharp  cutoff  at 
— /3,  below  which  the  scores  are  set  to  0.  and  above  which  the  scores 
are  set  to  1. 

Though  this  approach  does  not  train  different  weights  for  each 
modality  and  can  therefore  seem  sub-optimal  compared  to  weighted 
averaging  or  logistic  regression,  it  is  less  prone  to  over-fitting  as  it 
does  not  rely  on  labeled  positives  to  estimate  the  mean  and  variances. 
It  does  require  some  positives  to  tune  a  and  [3  but  as  these  parameters 
are  fixed  for  all  events,  their  estimation  is  quite  robust. 

4.  CALIBRATION  AND  THRESHOLD  SELECTION 

Besides  maximizing  average  precision,  a  second  challenge  of  the 
MED  2013  TRECVID  evaluation  [27]  is  to  select,  for  each  event, 
the  detection  threshold  t  that  maximizes  the  Ro  metric  defined  as: 
Ro(t)  =  Rec(t)  —  yRank(t)  where  K  =  12.5,  V  is  the  total 
number  of  clips  in  the  test  set,  Rec(t)  is  the  recall  at  threshold  t  and 
Rank(t)  is  the  number  of  clips  whose  score  is  larger  than  t.  It  can 
be  shown  that  Ro(t)  can  be  rewritten  as: 

Ro(t)  =  Ci  (C2  -  [(Kv?*)-1  -  1  ]Nmiss(t)  -  Nfa(t)) 

where  C\  and  C2  are  constants,  Nmiss{t)  and  Nfa(t)  are  the  re¬ 
spective  number  of  misses  and  false  alarms  at  threshold  t,  and  7r+ st 
is  the  ratio  of  positives  in  the  test  set.  The  threshold  that  maximizes 
this  quantity  also  minimizes  the  risk  given  by: 

Risk(t )  =  Crniss  ■  X miss  (1)  +  Cfa  '  N f  a(t) 

where  Cfa  =  1  and  Cmiss  =  (A'7r+st)_1  —  1.  Bayesian  decision 
theory  indicates  that  in  order  to  minimize  this  risk  the  system  should 
decide  that  the  clip  with  scores  x  is  a  positive  if  and  only  if 

P(V  =  +1  |  X)  •  Cmiss  >  P(V  =  “I  |  X)  •  Cfa 

which  defines  a  threshold  on  the  log-likelihood  ratio  (LLR): 

LLR  = (Igul-i >)  > (&)  - 

This  formulation  comes  in  handy  when  using  logistic  regression 
to  fuse  or  calibrate  scores.  Indeed,  it  can  be  shown  that  with  a  poste¬ 
rior  of  the  form  P(y  =  +l|x)  =  cr(axT  +  (3),  the  following  holds 
for  the  LLR: 

LLR  +  logit(TT+aln)  =  axT  +  (3 

where  7r*f<Mn  [s  the  ratio  of  positives  in  the  training  set.  Assuming 
that  the  scores  S  =  axT  +  j3  at  the  output  of  the  logistic  regression 
are  well  calibrated,  the  threshold  to  that  maximizes  Ro  is  therefore: 

to  =  log  (  Cfa  \  -  logit(A+st)  +  logit(n+azn) 

\  '-'miss  J 

/  •»  /  ts-  test\  7  • .  /  test\  .  7  •»  /  train  \ 

=  logit{KTv+  )  —  logit(iv+  ) -\- logit ) 

It  is  worth  noting  that  while  7r+ “m  is  known  from  the  training 
data  labels,  1 st  might  not  be  known  and  the  difference  between  the 
assumed  and  the  actual  7r(j?' s<  may  result  in  a  sub-optimal  threshold. 


5.  EXPERIMENTAL  RESULTS 

In  this  section,  we  first  describe  the  data  used  for  our  experiments 
and  then  present  results  on  system  fusion  using  two  separate  metrics. 

5.1.  Data 

We  evaluated  the  performance  of  late  fusion  according  to  the  NIST 
TRECVID  2013  MED  evaluation  plan  [27].  We  used  the  20  pre¬ 
specified  events  as  our  detection  targets.  We  ran  experiments  in 
two  conditions  with  varying  numbers  of  positives:  lOOEx  with  100 
positive  clips  per  event  and  lOEx  with  10  positive  clips.  An  extra 
set  of  4992  video  clips  labeled  as  negatives  is  used  to  supplement 
the  positives  for  each  event.  To  maximize  the  use  of  this  limited 
amount  of  training  data,  we  generated  scores  on  the  training  data  for 
each  modality  using  10-fold  cross-validation.  These  cross-validation 
scores  were  used  to  train  all  of  the  normalization  and  fusion  param¬ 
eters,  as  well  as  to  choose  thresholds.  The  MED  performance  is 
reported  on  MEDTEST,  a  set  of  23,468  video  clips  labeled  as  nega¬ 
tives  plus  1,489  video  clips  labeled  as  positives,  for  an  average  of  75 
labeled  positives  per  event. 

5.2.  Results  and  discussion 

For  logistic  regression  and  weighted  averaging,  we  reduced  the  num¬ 
ber  of  trained  parameters  by  merging  together  similar  modalities  us¬ 
ing  arithmetic  mean  in  the  posterior  domain  prior  to  learning  the  fu¬ 
sion.  Specifically,  we  fused  together  all  three  visual  modalities  and 
all  four  motion-based  modalities  to  create  two  aggregate  modalities. 
For  lOEx  for  logistic  regression,  we  even  averaged  those  two  modal¬ 
ities  into  a  single  modality. 

Columns  2  and  5  of  Table  1  present  the  mean  Average  Precision 
(mAP)  of  individual  sub-systems  and  of  several  fusion  schemes  on 
the  test  set.  In  both  training  conditions,  the  logistic  regression  and 
logistic  averaging  techniques  significantly  outperform  the  baseline 
fusion  techniques.  These  gains  can  be  explained  by  the  greater  flexi¬ 
bility  of  parametric  approaches  such  as  LR,  LA  and  WM.  Yet,  the  two 
proposed  approaches  do  not  suffer  the  over-fitting  problems  of  WM, 
either  because  the  number  of  trained  parameters  is  small  (LA)  or  be¬ 
cause  of  regularization  (LR).  When  100  examples  are  available  for 
training,  the  best  technique  is  LR+fs  with  a  mAP  of  0.434.  This  re¬ 
sults  demonstrate  the  efficiency  of  using  binary  indicators  to  handle 
missing  values  since  LR-min+fs  obtained  a  mAP  of  0.422  only.  Also, 
while  logistic  regression  proves  more  efficient  than  LA  for  lOOEx , 
this  trend  disappears  in  the  lOEx  condition  where  both  LA  and  LR- 
min+fs  perform  best.  We  believe  that  this  result  is  directly  related 
to  the  design  of  LA  as  a  fusion  technique  not  requiring  many  pos¬ 
itives  for  training  and  tuned  to  optimize  mAP.  In  contrast,  logistic 
regression  learns  event-dependent  weights  that  enable  it  to  perform 
well  over  a  wide  range  of  operating  points.  The  feature-selection 
component  of  logistic  regression  was  found  useful  and  provided  a 
significant  mAP  increase  in  both  conditions.  Modalities  were  dis¬ 
carded  for  7  events  for  lOEx  and  16  events  for  lOOEx. 

We  also  compared  the  performance  of  the  different  fusion  strate¬ 
gies  in  terms  of  the  Ro  metric.  We  considered  two  different  strate¬ 
gies  to  pick  the  threshold:  (1)  using  the  threshold  ttr  that  optimizes 
Ro  (t)  on  the  training  data,  and  (2)  using  the  threshold  to  computed 
as  in  the  theoretical  analysis  developed  in  Section  4.  The  latter  tech¬ 
nique  assumes  that  the  fused  scores  are  calibrated  likelihood  ratios. 
Because  this  is  only  the  case  for  the  LR  fusion,  we  first  calibrate 
the  output  of  the  other  fusion  strategies  for  each  event  by  using  a 
pass  of  logistic  regression  with  a  2-dimensional  feature  vector.  The 
two  dimensions  are  set  to  the  logit  of  the  fused  score  in  the  posterior 


domain  and  a  binary  indicator  that  is  set  to  1  if  the  score  is  miss¬ 
ing.  The  logistic  regression  parameters  are  estimated  using  the  fused 
cross-validation  scores  on  the  training  set.  The  a  and  ft  parameters 
of  the  logistic  averaging  approach  were  adjusted  to  maximize  Ro  (to) 
on  the  training  data. 

Results  in  terms  of  Ro,  the  mean  of  Ro  over  all  20  events,  are 
shown  in  Table  1 .  With  this  metric,  logistic  regression  consistently 
outperforms  the  other  fusion  approaches,  for  both  conditions.  Lo¬ 
gistic  averaging  remains  competitive  but  no  longer  outperforms  LR- 
min+fs  in  the  lOEx  condition.  An  advantage  of  logistic  regression 
fusion  over  other  techniques  for  maximizing  Ro  is  that  it  uses  a  fea¬ 
ture  vector  with  at  least  N  dimensions,  and  therefore  more  finely  ap¬ 
proximates  the  LLR.  The  resulting  scores  are  better  calibrated  than 
the  scores  obtained  through  a  pass  of  logistic  regression  following 
late-fusion.  Also,  we  should  point  out  that  the  to  threshold  obtained 
through  Bayesian  decision  theory  consistently  outperforms  the  more 
ad-hoc  technique  of  picking  ttT  using  the  training  data.  This  is  be¬ 
cause  the  Bayesian  formulation  enables  computing  the  theoretically 
optimal  threshold  given  the  event  detection  priors  on  the  test  data. 


Table  1:  Performance  of  various  score  fusion  techniques  on  the  test 
set.  Results  are  reported  in  both  conditions  in  terms  of  mAP  and  Ro- 
The  best  system  for  each  metric  is  reported  in  bold. 


System 

lOOEx 

lOEx 

mAP 

RO  ( J'tr ) 

Ro(to) 

mAP 

RO  ( J'tr  ) 

Ro(to) 

AM-zero 

0.405 

0.514 

0.524 

0.222 

0.263 

0.278 

AM-mean 

0.402 

0.506 

0.526 

0.214 

0.244 

0.272 

GM 

0.356 

0.483 

0.487 

0.206 

0.148 

0.038 

WM-dep 

0.404 

0.511 

0.532 

0.197 

0.259 

0.297 

WM-indep 

0.414 

0.503 

0.522 

0.213 

0.288 

0.298 

LA 

0.425 

0.521 

0.531 

0.244 

0.287 

0.304 

LR-rnin 

0.421 

0.519 

0.538 

0.235 

0.260 

0.309 

LR-min+/s 

0.422 

0.522 

0.541 

0.244 

0.288 

0.314 

LR 

0.428 

0.532 

0.537 

0.230 

0.262 

0.312 

UR+/.V 

0.434 

0.533 

0.546 

0.238 

0.295 

0.312 

6.  CONCLUSION  AND  FUTURE  WORK 

In  this  paper,  we  demonstrate  the  efficiency  of  parametric  ap¬ 
proaches  to  late  fusion  in  a  multimedia  event  detection  system,  even 
in  situations  with  limited  training  data.  The  proposed  logistic  re¬ 
gression  approach  to  score  fusion  handles  missing  scores  and  au¬ 
tomatically  performs  feature  selection  to  discard  poorly  performing 
modalities.  A  second  technique,  proposed  under  the  name  of  lo¬ 
gistic  averaging,  can  be  seen  as  a  pre-processing  approach  to  the 
arithmetic  mean  method  by  performing  Z-normalization  in  the  logit 
domain  before  mapping  scores  back  to  posteriors  in  a  way  that  max¬ 
imizes  a  given  metric.  The  logistic  regression  approach  significantly 
outperformed  baseline  techniques  in  terms  of  both  the  mAP  and  the 
Ro  metric.  Logistic  averaging  was  very  competitive  for  optimiz¬ 
ing  mAP  with  limited  training  data,  but  didn't  perform  as  well  on 
the  Ro  metric.  These  findings  are  comparable  to  results  for  other 
detection  tasks  such  as  speaker  identification  or  keyword  spotting 
where  logistic  regression  has  consistently  been  found  to  be  a  robust 
tool  to  combine  systems  and  provide  a  calibrated  output  that  can  be 
used  to  make  binary  decisions  over  a  wide  range  of  operating  points. 
Avenues  for  future  work  include  applying  the  fusion  techniques  in¬ 
troduced  in  this  paper  to  the  problem  of  query-based  event  detection, 
where  the  event  detection  models  are  built  from  an  event  description 
rather  than  learnt  using  positive  examples. 
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